TITLE OF THE INVENTION 
A CC»1PUTATIONM. METHOD FOR PREDICTING PROTEIN INTERACTION 



CROSS REFERENCE TO RELATED APPLICATIONS 
This application claims priority from U.S. Provisional 
Application NO. 60/415,742, filed on October 3, 2002, entitled 
THEMATICS: A SIMPLE COMPUTATIONAL METHOD FOR THE IDENTIFICATION 
AND CHARACTERIZATION OF ENZYME ACTIVE SITES, the whole of which is 
hereby incorporated by reference herein. 

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR 

DEVELOPMENT 

Part of the work leading to this invention was carried out 
with united States Government support provided under grants from 
the National Science Foundation, Grant Nos. MCB-0135303, MCB- 
0074574 and CHE-9974642. Therefore, the U.S. Government has 
certain rights in this invention. 

BACKGROUND OF THE INVENTION 
Knowledge of protein sequences and structures has burgeoned 
very recently as a result of genome sequencing (Birney et al. 
2001) and structural genomics efforts. In order to translate such 
information into tangible benefits for humankind, the next step is 
to develop methods that enable one to predict and to establish 
function from structure. This need becomes particularly acute as 
novel protein folds are discovered for which there are no proteins 
of similar structure with known function. 

For some classes of proteins, information about function can 
be inferred from the evolutionary history derived from sequence 
relationships (Lichtarge et al. 1966; Sjolander et al. 1998; Yao 
et al. 2003), or by other homology and non-homology sequence 
methods (Narcotte et al. Science 1999; Marcotte et al. Nature 
1999; Marcotte 2000; Ramani et al. 2003). Combined analysis of 
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sequence and structure data gives more revealing clues about 
function (Carter et al. 2001). Current methods to locate active 
sites rely either on analogies to related proteins of known 
function {Babbitt et al. 1998; Fetrow et al. 2001; Fetrow et al. 
1999; Fetrow et al. 1998; Hegyi et al. 1999; Skolnick et al. 2000; 
Teichmann et al. 2001; Wallace et al. 1991). on searches for 
clefts in the structure {Laskowski et al. 1996), or on 
computational searches for binding sites by docking {Park et al. 
2000) of selected sets of small molecules [Chen 2001; Chen et al. 
2001; Dennis et al. 2002). Energetics {Elcock 2001) and 
flexibility (Ma et al. 2001) can also predict functionally 
important sites. However, there is as yet no reliable method to 
identify active sites of enzymes or other interaction sites of 
proteins in the absence of biochemical data, even when the 
structure is known. 

BRIEF SUMMARY OF THE INVENTION 
The method of the invention, Theoretical Microscopic 
Titration Curves (THEMATICS) , is a computational method that 
predicts chemical and electrostatic properties of residues in 
proteins and utilizes information contained in those predictions 
to identify various interaction sites. The various interaction 
sites may include, for example, cof actor binding sites, ligand 
binding sites, catalytic (active) sites or recognition sites. The 
method of the invention identifies the ionizable residues in the 
protein with anomalous predicted titration behavior and searches 
for the clustering of those residues into putative interaction 
sites. Practicing the method of the invention requires only the 
structure of the subject protein (which may be deduced, a priori, 
from the amino acid sequence) and, thus, may be applied to 
proteins that bear no similarity in structure or sequence to any 
previously characterized protein. To predict functional 
information from the primary sequence, one starts with the primary 
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amino acid sequence as input and predicts the three-dimensional 
structure theoretically, using a modeling method including but not 
limited to comparative modeling, homology modeling, and threading. 
Then, one applies the method of the invention to the theoretical 
5 three-dimensional structure as described. 

In practicing the method of the invention, the mean net 
charge C, averaged over the ensemble of protein molecules, is 
determined as a function of pH for each of the ionizable residues 
in the protein structure. The resulting C(pH) curves represent 

10 the theoretical microscopic titration curves for each ionizable 
species in the protein structure. The majority of these C{pH) 
curves in a given protein structure have the typical sigmoidal 
shape, as predicted by the Henderson-Hasselbalch equation. 
However, a residue in a given protein may show an unusual shape in 

15 the predicted C(pH) function, indicating that partial protonation 
persists over a wide pH range. The identification of two or more 
residues with unusually shaped or non-sigmoidal predicted C(pH) 
curves is utilized to identify a positive cluster, or a predicted 
site of interaction in a protein. 

20 Thus, in a preferred embodiment, the invention is directed 

to obtaining a three-dimensional structure of the protein, 
calculating an electrical potential function for the protein 
structure, calculating a titration curve for each ionizable 
residue in the protein, evaluating the shape of the titration 

25 curve for each residue, identifying any residue with a perturbed 
titration curve in comparison with other residues of the same 
kind, and identifying any residues with perturbed titration curves 
that are in a cluster. The existence of a cluster indicates that 
the residues in the cluster are at an interaction site in the 

30 protein. The method may further include step of identifying any 
residue with a perturbed titration curve that does not fall into 
any cluster. 
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BRIEF DESCRIPTION OF THE FIGURES 
Other features and advantages of the invention will be 
apparent from the following description of the preferred 
embodiments thereof and from the claims, taken in conjunction with 
the accompanying drawings, in which: 

Fig. 1 shows titration curves, which is the predicted mean 
net charge as a function of pH, for tyrosine residues Y225-Y284 of 
the A-chain of alanine racemase (Y225 (■) , Y239 (O) , Y265(A), 

Y269(V), Y284(4)); 

Fig. 2 shows titration curves for histidine residues in the 
A-chain of triosephosphate isomerase (His-26( + ), His-95 (x) , His- 
100(*), His-115(n), His-185(B), His-195(0), His-224 (•) , andHis- 
248 (A) ) ; 

Fig. 3 shows a ribbon diagram of triosephosphate isomerase 
protein structure, looking down the a/(J barrel at the active site 
(backbone is shown in light gray; active-site residues with 
perturbed titration curves are shown in black and labeled) ; 

Fig. 4 shows a ribbon diagram of aldose reductase protein 
structure, looking down the a/(3 barrel at the active site (NADPH 
0 cofactor is shown in medium gray) (backbone is shown in light 
gray; active-site residues with perturbed titration curves are 
shown in black and labeled) ; 

Fig. 5 shows titration curves for selected tyrosine residues 

for aldose reductase (Tyr-39(+), Tyr-48(x), Tyr-177 (*) , Tyr- 

5 291(0), and Tyr-309(") ) ; 

Fig. 6 shows titration curves for selected lysine residues 

for phosphomannose isomerase (Lys-100(+), Lys-117(x), Lys-128(*), 

Lys-136(n), and Lys-153(B) ) ; 

Fig. 7 shows a ribbon diagram phosphomannose isomerase, 
0 looking down at the presumptive active site (zinc ion is shown as 
a medium gray ball) (backbone is shown in light gray; active-site 
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residues with perturbed titration curves are shown in black and 
labeled) ; 

Fig. 8 shows titration curves for lysine residues from 144 
through 355 of the A-chain of the template structure for aspartate 
5 aminotransferase (K144 ( + ) , K215 (x) , K248 (*) , K258 (□) , K288 
(■), K344 (O), K355 (•) ) ; and 

Fig. 9 shows a ribbon diagram of papain, showing the side 
chains of the THEMATICS positive residues C25 and H159 in the 
active site in red, the second cluster K17, K174 and Y186 in 
10 yellow, and the isolated residues E52 and R96 each in blue. 

DETAILED DESCRIPTION OF THE INVENTION 
The method of the invention is directed to predicting 
protein function from sequence and structure information, even in 

15 the absence of biochemical data. Theoretical Microscopic Titration 
Curves (THEMATICS) is a computational method that predicts the 
chemical and electrostatic properties of residues in enzymes or 
other proteins and uses that information to determine the 
functional activity at the atomic and molecular level of a given 

20 protein. THEMATICS makes use of theoretical microscopic titration 
curves calculated from the electrical potential function for all 
of the ionizable residues in the protein structure. THEMATICS 
searches for residues with anomalous shapes in their theoretical 
microscopic titration curves and then seeks clusters of such 

25 residues in coordinate space. Most chemical reactivity in proteins 
depends on residues that can function as either Br0nsted or Lewis 
acids and bases, and, therefore, such reactive residues are 
expected to show anomalous titration behavior. In this manner, 
THEMATICS locates regions within the protein structure where 

30 chemical reactivity (or interaction of any kind) is likely to 
occur. This method has proved to be a very reliable predictor of 
the location of interaction site(s), given the structure of an 
enzyme, whether that structure is known or determined 
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theoretically from the amino acid sequence of the protein. The 
various sites of activity may include, but are not limited to, 
catalytic sites, recognition sites, metal binding sites, cof actor 
binding sites and ligand binding sites. 
5 One of the advantages of the present method is its 

simplicity. It is computationally fast, is not database dependent 
and is amenable to automation for high-volume computational 
screening. Thus, the method of the invention will be a 
particularly useful tool for the biotechnology and drug 
10 industries, e.g., for screening proteins for target sites for the 
potential binding of drugs and other effector molecules. 

This method requires only the structure of the subject 
protein and thus complements the database-dependent approaches. 
The protein does not have to bear any resemblance in sequence or 
15 structure to any previously characterized protein for THEMATICS to 
be applicable. Based on the evidence presented, the method of the 
invention is also applicable to proteins for which an 
experimentally determined structure is unavailable, i.e., where a 
theoretical structure is determined from the amino acid sequence 
20 of the protein. 

Therefore, in practice of the method of the invention, a 
three-dimensional structure of the protein is first obtained. This 
structure can be determined experimentally (generally by x-ray 
diffraction or NMR) or downloaded from a database. The structure 
25 can also be a theoretical model structure; in this case, the 
initial input to the method is the primary sequence of the 
protein. The three-dimensional structure of the protein is then 
predicted theoretically, using a modeling method including but not 
limited to comparative modeling, homology modeling, and threading. 
30 Then, one applies the method of the invention to the theoretical 
three-dimensional structure as described. 

Once the model of the protein has been obtained, the 
electrical potential function is calculated for the protein 
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structure using standard methods in the art. There are a number of 
programs available for carrying out such calculations. These 
programs generally utilize a Poisson-Boltzmann {Yang et al. 1993; 
Bashford et al. 1991; Warwicker et al. 1982; Antosiewicz et al. 
5 1996) procedure and solve the equations with a finite difference 
method or a finite element method. Exemplary programs that 
calculate the electrical potential function for proteins include 
UHBD {Madura et al. 1995), DelPhi 

( http: //www. accelrys , com/insight /Delphi page.html) ^ APBS (Adaptive 

10 Poisson Boltzmann Solver, which uses a finite element method) 
{Baker et al. 2001) and WHATIF: a molecular modeling and drug 
design program {Vriend 1990) . 

Next, the titration curves for all of the ionizable residues 
in the protein are calculated, using standard procedures in the 

15 art. Exemplary methods include, but are not limited to, the full 
Boltzmann sum method (Bashford et al. 1991) r a hybrid Tanford- 
Roxby/Boltzmann method (Yang et al. 1993; Karshikoff 1995), and a 
Monte Carlo sampling method {Beroza et al. 1991). Programs that 
are available to perform these calculations include HYBRID 

20 {Antosiewicz et al. 1996; Gilson 1993) and WHATIF {Vriend 1990). 
One of these methods is used to calculate the mean net charge as a 
function of pH for each ionizable group (all lysine, arginine, 
aspartic acid, glutamic acid, histidine, tyrosine and cysteine 
residues, and the N-terminus and C-terminus) of each protein. 

25 Finally, the shapes of the predicted titration curves are 

compared and evaluated for each residue of the same kind (e.g., 
among all lysine residues, identifying the specific one(s) with 
anomalous shaped curves in comparison with all the rest) . Most of 
the curves have a sigmoidal shape, as predicted by the Henderson- 

30 Hasselbalch equation, and the anomalous residues, whose predicted 
titration curves have a non-sigmoidal shape, are identified. 
These anomalous residues have elongated titration curves, such 
that partial protonation is predicted to extend over a larger pH 
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range than for a more typical residue. The wide pH range is one 
that exceeds the range of stability of most protein structures. 
The titration functions are used as a diagnostic tool and not an 
implication that these extremes of pH could actually be achieved. 
5 We have developed three ways of identifying the residues with 
anomalous predicted titration behavior: (a) simple visual 
inspection of the curves by human observation; (b) statistical 
analysis of the curve features to identify the curves that are 
outliers; and (c) automated classification. Residues with 

10 anomalous predicted titration behavior are called THEMATICS 
positive residues . 

The anomalous residues located in physical proximity in the 
protein structure are then identified- A group of anomalous 
residues in physical proximity is called a THEMATICS positive 

15 cluster. Anomalous residues are in physical proximity when two or 
more are nearest neighbors, or are surrounding the same cleft as 
another anomalous residue, or have reactive atoms within a certain 
cut-off distance of each other. Such residues, by definition, 
belong to the same THEMATICS positive cluster. Distances between 

20 residues that are considered to be in physical proximity may be, 
e.g., less than 15 A, preferably less than 10 A, more preferably 
less than 7 A, and most preferably approximately 6 A or less. 

Identification of positive clusters by visual inspection of 
the curves often works very well, as in the proof-of-principal 

25 example shown in Figure 1. This figure, which presents theoretical 
titration curves calculated by THEMATICS, shows the predicted 
average net charge as a function of pH for tyrosine residues at 
amino acid positions in the range 225-284 in the A chain of 
alanine racemase, a bacterial enzyme that is a target for 

30 antibiotics and that has a known active site. As can be seen, Y239 
and Y269 are typical tyrosine residues with predicted titration 
curves that fit the Henderson-Hasselbach equation; Y225 has an 
upshifted pKa, but it has a nearly-sigmoidal shape; therefore, 
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none of these three is deemed positive and, indeed, none of these 
is an active site residue in the known structure. The predicted 
curve shapes for Y265 and Y284, however, are highly atypical and 
non-sigmoidal; thus, these constitute positive (i.e., anomalous) 
results. The flattened regions where partial protonation persists 
over a wide pH range are indicative of a residue that is part of 
an interaction site. Comparison with a known structure of alanine 
racemase that contains a substrate-mimic inhibitor reveals that 
these two residues are, indeed, in the active site; Y265 is a 
catalytic base and Y284 is involved in substrate recognition {Shaw 
et al. 1997; Stamper et al. 1998). Y265 and Y284 are part of the 
known active site cluster [R219, C311, K39, Y43, Y265, Y284, Y354, 
C358, Y164] that the THEMATICS method correctly predicts for 

alanine racemase. 

A more accurate way of identifying residues in a positive 
cluster is by a statistical analysis evaluation of the theoretical 
titration curves. A typical ionizable residue in a protein obeys 
the Henderson-Has selbalch equation: 

pH = pKa+ log{ [A-]/[HA] } (D 

Thus, as pH is increased, the predicted average charge falls 
sharply in a pH range close to the pKa- (The pKa of a residue is 
defined as the pH at which half of the protein molecules in an 
ensemble have that residue in protonated form HA and half in 
deprotonated form A".) The ionizable species goes from fully 
protonated to fully deprotonated in a narrow pH range. 

To analyze the calculated titration curves, the Henderson- 
Hasselbalch equation, equation (1), is rewritten to express the 
mean net charge C (for a given residue in an ensemble of protein 
molecules) as a function of pH, as: 

C(pH) = IQP'^^ / (10P» + IQP^^) (2) 
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for the residues that form a cation upon protonation (Arg, His, 
Lys and the N-terminal amino group). Equation (2) is written as: 

5 C(pH) = - IQP" / (IQP" + IQP^^) (3) 

for the residues that form an anion upon deprotonation (Asp, 
Cys, Glu, Tyr and the C-terminal carboxylate group) . Equations 
(2) and (3) both have the familiar sigmoid shape that is typical 
10 of a weak acid or base that obeys the Henderson-Hasselbalch 
equation . 

One of the metrics that is used to characterize the 
titration curves is the number of pH units over which the 
residue is partially protonated in the range 0.01 < C < 0.99. 
15 This number of pH units, P, may be defined as: 

P = pHc=o.oi - pHc=o.99 (4) 

P has the nominal value of 3.99, Phh = 2*logio(99) = 3.99, for a 
20 residue that obeys equation (2) (an ordinary Henderson-Hasselbalch 
residue) . P is evaluated for all of the ionizable residues in each 
protein. Statistical analysis of the P values for all of the 
ionizable residues then identifies the residues that have a 
statistically significant deviation from the average P. 
25 We can also fit the predicted titration curves to sigmoid 

functions of the form: 

C(pH) = 10^ / (10^*P" + 10^ ) (5) 

or : 

30 C(pH) = - 10^*P" / (10^*P" + 10^ ) (6) 
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where r and s have the nominal values of 1.0 and the pKa, 
respectively, for a Henderson-Hasselbalch residue. We then obtain 
the best-fit values for r and s. For instance, one can use a 
Levenberg-Marquardt procedure in which the quality of the fit is 
measured by chi-squared, . As one might expect, for the 
perturbed residues, r tends to deviate from 1.0, s tends to 
deviate from the pK^, and x' tends to be large (indicative of a 
poor-quality fit) . The values P, r, s and for each ionizable 
residue can be used as measures of the degree of perturbation, or 
deviation, from Henderson-Hasselbalch behavior. Again, statistical 
analysis of r, s and x' for all of the ionizable residues 
indicates which residues deviate from normal behavior and, 
therefore, points to the anomalous residues. A variation of the 
above procedure uses a one-parameter sigmoidal function, where r 
is the one parameter and x = pH - pK, is the dependent variable. 

Another valuable set of metrics that we have developed for 
the characterization of the titration curves involves a first 
derivative function f, defined as: 

1 dC/dx, (7) 

where x is the deviation from the pKa, x = pH - pKa. (Note that 
the first derivative itself is usually negative, since charge 
almost always decreases with increasing pH, so that f is 
positive.) The n'^^ moment of a function f may be defined as: 

< x" > = J- x" f dx (8) 

We have analyzed the n^^ moments of the first derivative function 
f(x). We found that the second and higher moments, particularly 
the fourth moment, are well correlated with perturbed titration 
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behavior and thus can be used to identify residues that are likely 
to be in the active site. 

As no one single statistical criterion, however, can 
identify the positive residues in all proteins, the best method 
5 of analyzing the theoretical titration curves is by automated 
classification. The automated classifier takes as input the 
results of the statistical analyses described above, or, 
alternatively, the predicted titration curves themselves, and 
identifies the anomalous residues using an optimal combination 

10 of weighted factors. Thus, the automated classifiers are more 
powerful and more sophistocated than simple statistical 
analysis. Two approaches have been developed at the present 
time for automatic identification of interaction sites in 
enzymes, given as input the spatial structure of the enzyme. 

15 Both involve using trainable classifiers, including but not 
limited to, neural networks and support vector machines. 

In the first approach, the trained classifier is used to 
replace the human visual inspection step as a means of identifying 
perturbed curves corresponding to residues likely to be in the 

20 active site. Various attributes of the curves are used as input to 
the classifier, including those described above for the 
statistical analysis process. The labeled training set for the 
classifier is derived from proteins for which literature exists 
identifying sites of activity. 

25 The resulting classifier is then used on proteins for which 

this data is not yet known, identifying likely candidates for 
residues belonging to active sites. Finally, this stage is 
combined with a second stage based on determining the physical 
proximity of multiple active-site candidates, as described 

30 above. The result is then an automatically determined THEMATICS 
positive cluster, as described earlier. 

In the second approach, an alternative one-step classifier 
is trained that takes as input both the data for all the 
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titration curves on the enzyme and the relative spatial 
locations of all the corresponding residues. Training and 
classification in this case requires devising attributes that 
combine measures of curve perturbation of a given residue with 
5 measures of the perturbation of all nearby residues, using the 
spatial distances to the nearby residues. In the simplest case, 
one can devise a single score to be assigned to each residue 
that combines the perturbation score of that residue (as in the 
first approach) with the perturbation scores of nearby residues 

10 weighted according to a function that drops off with distance. 

An important potential variant of either approach is to 
allow varying amounts of human involvement, such as allowing a 
user to interactively set various thresholds or other 
classification parameters in order to fine-tune the results 

15 produced by the otherwise automated program. This will be of 
particular importance for more knowledgeable users, for whom 
these techniques can be used to provide automated assistance in 
their quest for deeper understanding of specific proteins, e.g., 
enzymes, they wish to study. 

20 For a given protein molecule, most of the residues giving 

positive results form a cluster in physical space around the site 
where enzymatic catalysis occurs. For each of the three featured 
protein molecules with known active sites presented in Example I- 
III, the THEMATICS method correctly gives positive results for at 

25 least two of the active site residues, and most of the positive 
results are for what turn out to be active site residues. Less 
often, positive results are obtained for a residue just outside 
" the active site; these residues are termed second shell residues. 
These second shell residues, which are in physical proximity to 

30 active site residues but are not immediate neighbors of the 
reacting substrate, might well play important functional roles in 
catalysis or recognition. Occasionally, an apparent false 
positive, a residue with a perturbed theoretical titration 
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function that is not located near a known active site, is found. 
However, these false positive residues tend to be isolated in 
space and are thus distinguished readily from residues in positive 
clusters. The majority of positive results do cluster together and 
5 do occur in the active site, such that its location can be 
identified with confidence. These trends hold true for a number of 
types of enzymes, including six of the seven additional examples 
of enzymes shown in Table 1. 

1 0 Table 1. Positive Results for Some Additional Examples. 

PDBID Name Chemistry 

lAMQ Aspartate aminotransferase [44] transamination 

[H189, Y225, K258, R266, C191, C192, Y256\, [Y295], [H301] 

15 

ICSE Subtilisin Carlsberg peptide hydrolysis (serine protease) 

[D32, H64] 

1EA5 Acetylcholine esterase ester hydrolysis 

2 0 [Y130, E199, E327, H440, D5P2], [Y148], [H398], [H425] 

IHKA 6-Hydroxymethyl-7,8-dihydropterin pjrophosphate kinase [44] 

kinase 

[D97, HI 15] 

25 

lOPY 55-3-Ketosteroid isomerase isomerase 

[Y16,Y32,Y57], [C81] 

IPIP Papain peptide hydrolysis (Cys protease) 

30 [C25, H159], [K17, K174, Y186], [R59], [R96] 

IPSO Pepsin peptide hydrolysis (acid protease) 

[D32, D215, D303], [Dl 1] 

35 IWBA Winged bean albumin storage - no known chemistry 

No positive results 

Positive results that form a cluster in coordinate space are shown in brackets. Active site residues 
are shown in boldface. Second shell residues are shown in italics. 

40 
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For triosephosphate isomerase (TIM) (Example I) , the method 
of the invention yields four positive results. Two of the four 
(His95 and Glul65) are active site residues known to be involved 
in catalysis. The third positive result, that for Tyrl64, is 
5 located right next to the active site, in what might be thought of 
as the second shell surrounding the reacting substrate. A fourth 
residue, Lysll2, is a false positive. Seven positive results were 
found for aldose reductase (AR) (Example II) : Tyr4 8 is an active 
site residue and is known to be involved in catalysis; Cys2 98, 

10 Lys77 and Tyr209 are known active site residues; and Glul85 and 
Lys21 are located just outside the active site, again in the 
second shell; Tyrl07 we consider to be a false positive. For 
phosphomannose isomerase (PMI) (Example III), four positive 
results were found: three (LyslOO, Lysl36 and Tyr287) are located 

15 in the presumed active site; One (Hisl35) is in the second shell 
of the presumed active site. The total number of ionizable 
residues equals 77 per chain for TIM, 103 for AR and 134 for PMI. 
For these proteins, therefore, about 3-7% of the ionizable 
residues yield a positive result under the present definition. 

20 The calculated pKa value for His95 of TIM is consistent with 

^^N NMR data {Lodi et al. 1991). The calculated pKa of 8.4 for 
Tyr48 of AR is identical to the value reported in Bohren et al. 
reference (Bohren et al. 1994). All of the calculations reported 
here have been performed in the absence of cofactors and substrate 

25 mimics, in order to simulate conditions where a structure is built 
from genomic data and knowledge of the location or identity of 
such species in the protein structure may not be available. The 
presence of the NADPH cofactor in AR will cause some shifting of 
the pKaS of adjacent residues. Furthermore, the presence of a 

30 substrate or substrate mimic can cause substantial shift in the 
calculated pKa of an adjacent residue; this type of behavior was 
reported recently for alanine racemase {Ondrechen et al. 2001) . 
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A few positive results were found for residues that are 
presumed to be just outside the active site^ or in the second 
shell. These residues are Tyrl64 in TIM, Glul85 and Lys21 in AR 
and Hisl35 in PMI. It is undetermined whether these residues have 
5 perturbed titration functions because they actually participate in 
catalysis and/or recognition, or because they are simply 
physically very close to the area where catalysis occurs. In the 
latter case, they would presumably be subject to the same pH- 
dependent potentials as the species that are engaged in catalysis. 

10 Three out of four of these residues are conserved {Gish et al. 
1993) . Tyrl64 in TIM and Hisl35 in PMI are conserved across a 
range of species. A ClustalW alignment 

( http : //www , ebi . ac . uk/clustalw ) (Thompson et al. 1997) was run for 
TIM sequences from chicken, human, mouse, Drosophila melanogaster ^ 

15 Zea mays (maize) and Emericella nidulans , His95, Tyrl64 and Glul65 
are conserved in TIM from all of these species. A ClustalW 
alignment (see supra) was run for PMI sequences from Candida 
albicans ^ human, Saccharomyces cerevisiae, Emericella nidulans ^ 
and Salmonella typhimurium, LyslOO, Hisl35, Lysl36 and Tyr287 are 

20 conserved across all of these species. Glul85 in human aldose 
reductase is conserved over a number of human enzymes with some 
sequence identity. A ClustalW alignment (see supra) alignment was 
run for sequences of the human enzymes aldose reductase, bile acid 
binding protein, chlordecone reductase, 3-oxo-5-beta-steroid 

25 reductase and aldehyde reductase. Tyr48, Glul85, Lys77 and Tyrl07 
are conserved across these enzymes. Cys298 and Lys21 are not 
conserved. Tyr209 is conserved across all of the proteins except 
chlordecone reductase, where it is replaced by a histidine. This 
observed conservation lends support to the notion that these 

30 residues do play functional roles, e.g., in catalysis, 
recognition, catalytic efficiency or in the stabilization of the 
protein fold. 
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The question why there are sometimes apparent false 
positive results is an intriguing one. One possible explanation 
is that proteins simply have strong electric fields and that 
occasionally these pH-dependent fields, arising either from 
5 long-range forces or from a strongly-coupled neighbor, will just 
happen to cause perturbations in the titration functions that 
resemble those of catalytic residues. Another possible 
explanation is that these residues actually do have some 
chemical function that has not yet been established. A third 

10 possibility is that these residues are a part of a vestigial (or 
incipient) active site that had (or will have) a function for 
some past (or future) species. A fourth possibility is that they 
are artifacts related to the quality of the input atomic 
coordinates, namely, the interpretation of the electron density 

15 maps in a region of some disorder may position residues 
inaccurately . 

THEMATICS calculations on an enzyme of unknown function 
usually yield one cluster and this cluster is presumed to be the 
active site. Thus one can identify the active site even if that 

20 protein bears no resemblance to any previously characterized 
protein. However, for a protein of unknown function, to obtain 
further information about the function, to characterize the sites 
in that protein, to determine what type of chemistry might be 
catalyzed by that protein, and to establish where recognition 

2 5 takes place in the protein structure, one needs to compare the 
results with proteins of known function that have been 
characterized. For instance, analysis of the results for the 
protein Kex2 (described in Example IX, below) reveals a cluster 
that contains the well-known catalytic triad of a hydrolase. One 

30 can then hypothesize that the other positive residues that are not 
a part of that catalytic triad are involved in recognition. 



-17- 

ATTORNEY DOCKET NO. NU-593XX 
WEINGARTEN, SCHURGIN, 
GAGNEBIN & LEBOVICI LLP 
TEL. (617) 542-2290 
FAX. (617) 451-0313 



The following examples are presented to illustrate the 
advantages of the present invention and to assist one of ordinary 
skill in making and using the same. These examples are not 
intended in any way otherwise to limit the scope of the 
5 disclosure. The contents of all references and pending and 
published patent applications cited throughout this application 
are hereby incorporated by reference. 

Idexi'blfxca'tlon of the sites of interaction, using as exaznples, 
10 enzymes with known active sites. (EXAMPLES I - III) 

EXAMPLE I 
Triosephosphate Isomerase (TIM; 

TIM (Zhang et al. 1994; Coulson et al. 1910; Pi^aley at al. 

15 1910; Hartman 1910; Komives et al. 1991; Lodi et al. 1991) 
catalyzes the isomerization of D-glyceraldehyde 3-phosphate (GAP) 
to dihydroxyacetone phosphate (DHAP) , a key reaction in the 
glycolytic pathway^ and has the a/jS barrel (^^TIM barrel") fold. 
Calculations on TIM from Gallus gallus (chicken) were performed 

20 using the 1.80 A resolution structure PDB Code ITPH {Zhang et al. 
1994; Herman et al. 2000) of the TIM-phosphoglycolohydroxamate 
complex. Theoretical titration functions were obtained for the 
biologically active dimer from which the coordinates for the 
inhibitor were removed. 

25 The observations of the theoretical titration functions for 

all of the ionizable residues of TIM indicated that four residues 
have curves with highly perturbed shapes: His95, Glul65^ Lysll2 
and Tyrl64. Fig. 2 shows the predicted mean net charge as a 
function of pH for the eight histidine residues in the A-chain. 

30 Predicted titration functions for the B chain (not shown) were 
similar to the A chain. Fig. 3 shows the location of His95, Glul65 
and Tyrl64 in the active site of TIM. 

The theoretical titration functions for most of the 
histidine residues depicted in Fig. 2 have the typical shape. 
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However^ it was observed that His95 has a strikingly different 
predicted titration function. Not only was the pKa of the 
conjugate acid downshifted to 3.2, but also the shape of the curve 
was perturbed; in particular, the curve for His95 has a flat 
5 region where the mean net charge stays nearly constant over a few 
pH units (about pH -2.0 to +1.5). Thus, His95 was predicted to 
stay partially protonated over a wide pH range. This change in 
shape of the titration curve was the most significant difference 
between His95 and the other histidine residues. The predicted mean 

10 net charge for His95 is 0.10, 0.03 and 0.00 for pH 5.0, 6.0 and 
7.0 respectively, consistent with experimental data, as discussed 
below. Similarly the predicted titration curve for Glul65 was 
different from the other glutamates. Again there was an unusual, 
nearly flat region in the curve (at about pH 1.0 to 4.0) and the 

15 residue is predicted to be partially protonated over an uncommonly 
wide pH range, with a downshifted pKa of -0.5. A third residue, 
Lysll2, was found to have a flat region of partial protonation in 
the pH range 11-14, with an upshifted pKa of about 16.5 for its 
conjugate acid. A fourth residue, Tyrl64, was found to have a 

20 nearly flat region of partial ionization in the theoretical 
titration curve (at about pH 13.0 - 16.0). It was predicted to 
remain uncharged up to pH 13 with predicted charges of -0.01, - 
0.03, and -0.10 at pH 13.0, 15.0, and 17.0, respectively and a 
calculated pKa of 18.2. Thus it was predicted to have an unusually 

25 high pKa- Furthermore the shape of the predicted titration curve 
was noticeably different from those of the other tyrosine 
residues . 

Three of the four residues with perturbed titration curves 
are in close spatial proximity: His95, Glul65 and Tyrl64. This 
30 region of physical space correlates with biochemical evidence for 
the location of the active site. Structural evidence suggests 
{Zhang et al. 1994) that the two residues that are active in 
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acid/base catalysis are Glul65 and His95. Earlier affinity- 
labeling experiments established the side chain of Glul65 as the 
catalytic base {Coulson et al. 1970; Waley et al. 1970; Hartman 
1970) . Spectroscopic evidence suggested that an elect rophilic 
residue was involved in the catalysis {Komives et al. 1991) and 
the x-ray crystal structure revealed His95 as the likely 
electrophile {Zhang et al. 1994). ^^N NMR spectra show that the 
imidazole ring of His95 is substantially uncharged over the pH 
range 4.9-9.9, implying that the pKa is less than 4.5 {Lodi et al. 
1991) . The predicted mean net charges and calculated pKa of 3.2 
for His95 were consistent with the ^^N NMR data. Tyrl64 is located 
right next to the active site. The x-ray crystal structure 
indicates that it is pointing away from the precise location where 
the substrate is believed to bind (Zhang et al. 1994), although 
this does not necessarily preclude involvement in catalysis. 
Lysll2 is physically well removed from the active site and is 
considered to be a false positive. 

The above analysis pertains to four residues originally 
identified by human observers as having anomalously shaped 
titration curves. A subsequent statistical analysis of the P 
values (as defined, above, by Equation 4) and the moments of the 
function f (Equation 7) identified a slightly different set of 
outliers: His95, Glu 165, C126, and Y164. These four residues 
form a single cluster. Again, His95 and Glul65 are catalytic 
residues while C126 and Y164 are second shell residues. The 
important point here is that, while the list of positive residues 
is slightly different depending on the type of analysis used, the 
catalytic site and the important catalytic residues for this known 
active site are identified by the method of the invention. 
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EXAMPLE II 
Aldose Reductase (AR) 

AR {Harrison et al. 1994; Wilson et al. 1992; Bohren et al. 
1994) is an NADPH-dependent enzyme that catalyzes the reduction of 
5 the aldehyde group in an aldose to the corresponding alcohol. 
Calculations were performed on the biologically active monomer. 
The 1.7 6 A resolution x-ray crystal structure 2ACS {Harrison et 
al. 1994) of human aldose reductase from the Protein Data Bank 
{Berman et al. 2000) was utilized. The structure 2ACS contains the 
10 nicotinamide cofactor and a citrate ion; the calculation was 
performed on the protein alone without the cofactor or ion. Figs. 
3 and 4 show the relationship between the structures of TIM and 
AR. 

The theoretical titration functions for AR reveal seven 

15 residues with unusual curves: Tyr48, Cys298, Glul85, Lys21, Lys77, 
Tyrl07 and Tyr209. Fig. 5 shows the predicted mean net charge as a 
function of pH for five selected tyrosine residues that show 
typical curves but do not obscure the curve for Tyr4 8. Tyr4 8 is 
predicted to have an extended region of partial protonation (i.e. 

20 non-integer mean net charge) in the pH range 12.5-17.5. The curve 
for Tyr48 is nearly flat from about pH 12.0 to 15.0, with a long 
tail on the higher pH end that crosses over curves of residues 
with higher pKa- The pKa of Tyr48 is calculated to be 8.4, at 
least a few pH units lower than the other tyrosine residues shown. 

25 The low pKa and the flat region at the higher pH side of the curve 
make Tyr48 partially protonated over the remarkably wide pH range 
of 5.0 through 17.5. The other tyrosine residues in Fig. 5 have 
sigmoidal, or nearly sigmoidal, theoretical titration curves 
shapes, which are characteristic of an ordinary Henderson- 

30 Hasselbalch residue. It is also observed that Cys298 has a wider 
range of partial protonation than the other cysteine residues and 
a flat region in the pH range 14-16. Its calculated pKa of 10.2 
is similar to most of the other cysteine residues in this protein, 
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although a little higher than typically observed for a thiol group 
free in solution. The calculated titration curve for Glul85 
exhibits a nearly flat region in the pH range 0-3; the calculated 
pKa of -1,7 is significantly downshifted from the other 
5 glutamates. Calculated mean net charges for Glul85 are -0.94, - 
0.97, -0.98, and -0.99 for pH 0.0, 1.0, 2.0, and 3.0, 
respectively. The titration curve for Lys21 has a flat region of 
non-integer net charge in the pH range 8-11, with a predicted pKa 
of 13.4 for its conjugate acid. Lys77 has an extended flat region 

10 of partial protonation in the pH range 9-17, with a heavily 
upshifted predicted pKa of 19.6 for the conjugate acid. The 
predicted curve for Tyrl07 exhibits a flat region in the pH range 
11-15 with non-integer mean net charge; the pKa is predicted to be 
upshifted to 17.0. Finally, Tyr209 was found to have an extended 

15 flat region of partial protonation in the pH range 9.5 - 15.5, 
with a very upshifted predicted pKa of 17.9. An eighth residue, 
HisllO, has an extended region of partial protonation but we did 
not count it in the original visual analysis because the 
differences in its titration curve are not as apparent as they are 

20 for the seven residues discussed above. Statistical analysis 
later identified HisllO as an outlier, as described below. 

Again, the majority of the residues with positive results 
are close in space and the active site location indicated by the 
theoretical titration functions is consistent with biochemical 

25 data. The 1992 x-ray crystal structure at 1.65 A resolution 
suggested that Tyr48, HisllO and Cys2 98 are the catalytically 
active residues {Wilson et al. 1992), Site-directed mutagenesis 
experiments showed that the Y4 8F mutant has no activity and the 
HllOQ and HllOA mutants lose activity by 1*10^ - 2*10^ fold 

30 {Bohren et al. 1994); thus it was concluded that Tyr48 is the 
proton donor and HisllO is involved in the catalysis. Reference 
30 also reports pH profiles of kinetic data and from them infers 
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that the pKa of Tyr48 in the wild type enzyme is 8.4, with which 
our calculated value is in good agreement. Four of the seven 
residues with positive results, Tyr48, Cys298, Lys77 and Tyr209, 
were all identified as active site residues from structural and 
5 biochemical data {Harrison et al. 1994). Two more of the seven, 
Glul85 and Lys21, are located right next to active site residues: 
Lys21 is just behind active site residue Trp20 and also next to 
the cofactor; Glul85 is located just behind active site residues 
Tyr209 and Cys298. Tyrl07 is located behind active site residue 

10 Lys77, but is far enough away from the catalytic site that we 
consider it to be a false positive. 

The above analysis pertains to the residues originally 
identified as positive by human observers. Subsequently, 
statistical analysis added HisllO, a known catalytic residue, to 

15 the list. HisllO is in physical proximity to residues in the 
active site cluster and, therefore, is a member of that cluster. 
Thus, as will be shown in more detail in some of the following 
examples below, statistical analysis provides a more complete list 
of important residues. 

20 

EXAMPLE III 
Phosphomannose Isomerase (PMI^ 

PMI {Cleashy et al. 1996; Gracy et al. 1968; Wells et al. 
1994; Malaisee-Lagae et ai. 1989) catalyzes the reversible 

25 interconversion of mannose-6-phosphate and f ructose-6-phosphate . 
PMI is a metal-dependent enzyme containing one atom of zinc per 
protein molecule. The 1.70 A resolution x-ray crystal structure 
of IPMI (Cleasby et al. 1996) , the enzyme from Candida albicans^ 
was used in the calculations here. The zinc ion was included in 

30 the calculation. 

Upon examination of the theoretical titration functions for 
PMI, we find four residues with perturbed curves that possess a 
flat or nearly flat region: Hisl35, LyslOO, Lysl36, and Tyr287. 
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The titration curve for another residue, Glu294, has an obviously 
perturbed curve in that its slope is considerably less steep than 
that of the other glutamates. It is predicted to be partially 
protonated over the very wide range of about pH -2.0 through 7 . 0 
5 with a pKa of 2.5. Specifically, it lacks a flat region of 
partial protonation but instead has an unusually shallow, nearly 
constant, slope over a wide pH range. For present purposes, we 
are interpreting the curves in conservative fashion and are 
excluding Glu294 from the positive list. 

10 Fig. 6 shows the predicted mean net charge as a function of 

pH for five selected lysine residues in PMI . The lysine residues 
were selected to show some with typical curves (Lysll7 and 
Lysl28), one with a slightly perturbed slope (Lysl53) , and the two 
curves with unusual shapes (LyslOO and Lysl36) . LyslOO exhibits a 

15 flat region of non-integer mean net charge at about pH 11-14; its 
conjugate acid is predicted to have a high pKa of 16.4. Lysl36 
has a more typical pKa (of 12.5 for the conjugate acid) but it has 
a tail on the high pH end of its titration curve, with a flat 
region at about pH 16.0 - 17.5. Also Hisl35 has a flat region of 

20 partial protonation in the pH range 1-4; its predicted pKa of 5.8 
is a typical value for the conjugate acid of a histidine residue. 
Tyr287 has a flat region of non-integer mean net charge in the pH 
range 14.5 - 17.0 and has a very unusually high pKa (>19) . 

The precise mechanism of action of PMI is not yet known. 

25 However, because it is a metalloenzyme and the metal ion is 
essential for activity {Gracy et al. 1968), the active site is 
presumed to be located in an observed cleft near the zinc ion. 
The structural data of reference 32, together with labeling 
studies {Wells et al. 1994), identify six ionizable active site 

30 residues that are not involved in coordination of the zinc ion: 
Arg304, Glu48, Glu294, LyslOO, Lysl36, Tyr287. Fig. 7 shows the 
structure and some of the presumed active site residues for PMI. 

-24- 

ATTORNEY DOCKET NO. NU-593XX 
WEINGARTEN, SCHURGIN, 
GAGNEBIN & LEBOVICI LLP 
TEL. (617) 542-2290 
FAX, (617) 451-0313 



Experimental evidence supports a proton transfer mechanism via a 
cis ene-diol intermediate (Malaisse-Lagae et al. 1989), rather 
than by a hydride shift as for xylose isomerase (Lavie et al. 
1994; Carrell et al. 1994). 
5 Therefore, of the four residues with a positive result, all 

are in close proximity and three residues, LyslOO, Lysl36 and 
Tyr287, are in the presumed active site. Hisl35 is located just 
behind the presumed active site residue Lysl36 and, therefore, is 
a second shell residue. 

10 

Application of THEMA.TICS to theoretical model structures (EXAMPIiES 
IV - VII). In Examples IV - VII, we start with the primary 
sequence and then build a theoretical model structure. Then 
THEMA.TICS Is applied on the theoretical model structure. 

15 

EXAMPLE IV 

Trlosephosphate Isomerase (TIM) orthologs 

The first of the four structures homologous to the chicken 
trlosephosphate isomerase (TIM) structure ITPH is built from the 

20 sequence for Schistosoma japonicum {Liu et al. 2003) with 60% 
sequence identity in the pair wise alignment and 0.16 A RMSD value 
for the model structure. The second model is determined for the 
sequence for Enterococcus faecalis (Paulsen et al. 2003) with 40.2 
% sequence identity, resulting in a 0.2 9 A RMSD value with the 

25 template structure. The third model is built from the sequence of 
Bartonella henselae with 38.7 % identity and RMSD value of 0.31 A, 
while the last model is built from the sequence of Mycoplasma 
genitalium (Erase et al. 1995) with 33 % identity and RMSD value 
of 1.73 A. These structures are all obtained with MODELLER and are 

30 summarized in Table 2 (shown below) . 

Table 2 gives the THEMATICS result for the active site 
cluster for each template structure and the orthologous model 
structures. Known active site residues are shown in boldface and 
second shell residues (those immediately adjacent to active site 
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residues but not considered to be in the active site) are 
underlined . 

For the TIM structure from chicken (ITPH), four neighboring 
residues with anomalous titration behaviour are identified as the 
5 active site cluster. Two of these residues, H95 and E165, are 
well established by experiment as catalytically active residues 
(Lodi at al. 1991; Zhang et al. 1994) 

Two other residues, C126 and Y164, are located in the active 
site cleft but any possible catalytic role for these residues has 

10 not been investigated experimentally. Upon alignment of the 
sequences and superposition of the structures, it is confirmed 
that all four of these residues are conserved, both in the 
sequence and in the spatial arrangement of the active site cleft, 
in all of the four model structures. Sequence alignment across a 

15 wider range of species again reveals high conservation of all four 
of these residues. 

For all four of the model structures, THEMATICS finds the 
active site. THEMATICS identifies (by pronounced perturbed shape 
of the predicted titration curves) the two catalytic residues H95 

20 and E165, plus C126, in all four of the model structures. For two 
of the four models, S. japonicum and B. hensalae, Y164 also 
exhibits perturbed titration behaviour. Y164 does not show 
significant perturbed titration behaviour in the model structures 
for E. faecalis and M. genitalium. Thus THEMATICS identifies the 

25 correct active site cluster for all of the model structures, but 
Y164 is not always included in the predicted active site cluster. 

EXAMPLE V 

6-Hydroxymethyl-7 , 8-dihydropterin pyrophosphoklnase 
30 (HPPK) orthologs 

6-Hydroxymethyl-7 , 8-dihydropterin pyrophosphoklnase, HPPK, is 

a monomeric pyrophosphate transferase with a-p plaits topology. 

Its crystal structure for E. coli (PDB code IHKA) was obtained 
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from the protein data bank with 1.5 A resolution {Xiao et al. 
1999) . Four homologous models to the E. coli structure IHKA are 
built using the MODELLER software from the sequences for the 
following organisms: Vibrio vulnificus {Rhee et al. 2002) (with 
5 63% sequence identity with E. coli and 0.34 A RMSD) , ViJbrio 
parahaemolyticus {Makino et al. 2003) (with 57% sequence identity 
and 0.22 A RMSD) , Pseudomonas aeruginosa (Stover et al. 2000) 
(with 51% sequence identity and 0.36 A RMSD) and Pseudomonas 
putida {Nelson et al. 2002) (with 48% sequence identity and 0.50 A 
10 RMSD) . 

The predicted titration curves for residues D95, D97 and 
H115 in the template E. coli structure do not show the usual 
sigmoidal behaviour and are identified as the active site cluster 
by statistical analysis. All three of these residues have been 

15 identified as active site residues in an experimentally determined 
inhibitor structure {Stammers et al. 1999). All of them are 
conserved across the four species for which the model structures 
are built and are also generally well conserved across bacterial 
kinases. When the four model structures are overlaid onto the 

20 template E. coli structure, the positions of these residues are 
conserved in the active site pocket with similar orientations. 

For the HPPK case, THEMATICS identifies the same cluster for 
all four of the model structures as for the E. coli template 
structure (see Table 2, shown below) . 

25 

EXAMPLE VI 

Aspar'ba'be aminotransferase (AspAT) orthologs 

The structure of the pyridoxamine 5 '-phosphate (vitamin B6) 
dependent enzyme Aspartate aminotransferase (AspAT) from E. coli 
30 at 2.2 A resolution (PDB code lAMR) {Miyahara et al. 1994) is used 
as the template. Its fold is a unique aminotransferase fold. 

AspAT is active as a homodimer and the calculations are performed 
on the dimer structure. 
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Using MODELLER software {Fiser et al. 2000), four model 
structures homologous to the AspAT template from E. coll are 
constructed from the sequences for the following organisms: VaJbrio 
cholera {Heidelberg et al. 2000) (with 62 % pairwise identity and 
5 1.52 A RMSD) , Oryza sativa {Sasaki et al. 2001) (with 44% identity 
and 0.64 A RMSD) , and Neiserria meningitides {Tetelin et al. 2000) 
(41 % identity and 1.28 A RMSD) ^ and Clostridium perfringens (22% 
identity and 3.67 A RMSD) . 

Figure 8 illustrates the predicted titration curves for some 

10 of the lysine residues of the template structure for E. coli. 
Predicted mean net charge as a function of pH is shown for all of 
the lysine residues in the sequence between 144 and 355 
(inclusive) of the A chain of the homodimer. The titration curves 
are given for K144 ( + ) , K215 (x) , K248 (*) , K258 (□) , K288 (■) , 

15 K344 (O), and K355 (•) . Note the elongated, highly non- 
sigmoidal shape of the catalytic lysine residue K258. The other 
lysine residues all have sigmoidal or nearly sigmoidal shape, with 
the characteristic sharp fall-off in charge in the region where 
the charge is approximately equal to one-half. 

20 For the template E. coli structure, THEMATICS finds a 

cluster of residues with perturbed titration behaviour, consisting 
of the following: H189, Y225, K258, R266, C191 and C192. It has 
been established experimentally that K258 is the catalytic base 
that initiates transamination, that H189, Y225 and R266 are also 

25 in the active site pocket, and that nearby residues outside the 
active site, such as C191, also play a role in the catalytic 
activity (Miyahara et al. 1994; Jeffery et al. 1998; Jeffery et 
al. 2000; Mizuguchi 2001). When sequential alignment is performed 
on all of the model sequences and the E. coli sequence, the 

30 identified residues are all conserved, except that C191 and C192 
are not present in Clostridium perfringens . Superposition of the 
model structures onto the template structure reveal that the 
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identified residues are located in the same region of the protein 
with similar orientations. 

For all four models and for the template, THEMATICS finds 
the active site cluster, although the list of identified residues 
5 is a little different for each species for the AspAT case (see 
Table 2, shown below) . 



Table 2. Summary of Orthologous Model Structures and Results 



1 0 Enzyme Species % Identity RMSD THEMATICS Result 



S. japonicum 


60% 


0.16 


[H95,E165,C126, Y164] 


E.faecalis 


40% 


0.29 


[H95, E165, C126] 


B. hensalae 


39% 


0.31 


[H95, E165, C126, Y164] 


M. genitalium 


33% 


1.73 


[H95, E165,C126] 


Template structure: 


Chicken (ITPH) 




[H95, E165, C126, Y164] 



HPPK V. vulnificus 63% 0.34 [D95, D97, H115] 

V. parahaemolyticus 57% 0.22 [D95, D97, H115] 

Ps. aeruginosa 51% 0.36 [D95, D97, H115] 

20 Ps.putida 48% 0.50 [D95, D97, H115] 

Template structure: E. coli (IHKA) [D95, D97, H115] 



AspAT 

[H189, Y225, K258, C191, C192] 
25 Oryzasativa 44% 0.64 [H189, Y225, K258, R266, C191, 

Y295] 

[H189, Y225, K258, C191 , C1921 
[H189, Y225, K258, R266] 
[H189, Y225, K258, R266, C191, 
30 C192 , Y256 1 

For each model structure, % pairwise identity with the template and the RMSD value in A are 
given. THEMATICS results for the active site cluster are given with known active site residues 
shown in boldface and second shell residues underlined . Sequence numbers for the models are 
3 5 adjusted to match those of the template structures. 



V. cholerae 


62% 


1.52 


Oryza sativa 


44% 


0.64 


N. meningitides 


41% 


1.28 


C. perfringens 


22% 


3.67 


Template structure: 


E. coU (lAMR) 



40 



EXAMPI£ VII 
Htuaan homologues of aldose reductase 

Aldose reductase is an NADH-dependent enzyme that catalyses 
the reduction of the aldehyde group in an aldose to an alcohol. 
It has the a/p barrel (^'TIM barrel") fold and is active as a 
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monomer. The x-ray crystal structure data are obtained from the 
protein data bank (PDB code 2ACS) with 1.76 A resolution (Harrison 
et al. 1994) . 

Homologous structures to human aldose reductase are built 
5 using SWISS-MODEL {Schwede et al. 2003) . Model structures are 
constructed using homologous protein sequences from the human and 
the template structure 2ACS for aldose reductase. These model 
protein structures all have the same fold as aldose reductase but 
they perform different functions and catalyse different reactions. 

10 The following model protein structures have been constructed: 
aldehyde reductase [Wermuth et al. 1987), (with 50 % pair wise 
identity), bile acid binding protein [Stolz et al. 1993) (with 48% 
identity), 3-oxo-5-beta steroid dehydrogenase {Kondo et al. 1994) 
(50% identity), and chlordecone reductase (Qin et al. 1993) (48% 

15 identity) . The x-ray structures for aldehyde reductase (2ALR) and 
bile acid binding protein (IIHI) are available from the PDB and 
were used to check our results for the model structures. 

THEMATICS computation on the template aldose reductase 
structure identifies by statistical analysis the following cluster 

20 of residues as important: C298, HllO, K77, Y48, Y209, E185 and 
K21. C298, HllO, K77, Y48, Y209 are known active site residues 
while E185 and K21 are just behind the active site in the second 
shell surrounding the reacting substrate {Harrison et al. 1994) . 
Results are summarized in Table 3. For the models, the residues 

25 that occupy equivalent positions in the structure are vertically 
aligned in Table 3. Only the residues that are predicted to have 
perturbed titration behaviour are shown. As these enzymes have 
different functions, not all of the residues in the THEMATICS 
active site cluster are conserved, although K77, Y48 and E185 are 

30 conserved. 

THEMATICS correctly locates the active site cluster for 
these human homologues of aldose reductase, in spite of some 
sequence variability in the active site and in spite of 
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10 



15 



differences in function. Note that residues in equivalent 
positions are sometimes identified by THEMATICS even when there is 
amino acid substitution at that location. 

It is also instructive to compare the THEMATICS results for 
the model structures of aldehyde reductase and bile acid binding 
protein with the results for the experimentally determined x-ray 
crystal structures. For aldehyde reductase, five out of the seven 
residues identified for the model structure are also identified 
for the crystal structure (H112, K79, Y49, Y209 and E185) ; two of 
the seven (K22 and Y2 97) are identified for the model structure 
but not for the crystal structure. For bile acid binding protein, 
the model structure and the crystal structure both identify five 
active site cluster residues: E117, K84, Y55, Y216, and E192. K27 
is identified for the model structure but not for the crystal 
structure. Both structures give negative results for R304 (which 
occupies the position corresponding to C298 in aldose reductase) . 



20 



Table 3. Human Homologues of Aldose Reductase 



Enzyme / % Identity 

Template: 
Aldose reductase 



THEMATICS result 



[C298,H110, K77, Y48, Y209, E185, K21 ] 



Models: 

2 5 Aldehyde reductase / 50% 

Bile acid binding protein / 48% 
3-oxo-5-beta steroid dehydrogenase 
Chlordecone reductase / 48% 



Y297 HI 12 K79 Y49 

np^ El 17 K84 Y55 

/50% np^ HI 13 K80 Y51 

Y304 HI 16 K83 Y54 



Y209 El 85 K22 

Y216 E192 K27 

Y212 E188 np^ 

H215 E191 np^ 



30 THEMATICS results are given for human aldose reductase and four human homologues with 
different chemical functions. Residues occupying the same position in the structure are aligned 
vertically in the table. In the template, known active site residues are shown in boldface; second 
shell residues are underlined . Only perturbed residues (those identified by THEMATICS) are 
shown, np = the residue in that position is not perturbed and is not identified by THEMATICS; np^ 



35 



= R304; np^ - Y301 ; np^ = E24; np"^ = E27 
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EXAMPLE VIZI 

Use of THEMATICS finds two positive clusters in Papain 

Papain is a cysteine protease from papaya with a hydrolase 
fold. The monomeric structure IPIP [Yamamoto et al. 1992] at 1.7 
A resolution was used as input. 

Papain is one of the relatively unusual cases in which 
THEMATICS has found two positive clusters. One of these 
clusters, [C25, H159] , is the known protease active site. The 
cysteine C25 acts as a nucleophile to attack the carbon atom of 
the scissile bond. H159 acts as a proton shuttle. The function 
for the other cluster [K17, K174, Y186] is currently unknown. In 
addition, THEMATICS finds two isolated residues, E52 and R96, 
that have anomalous curve shapes. Figure 7 shows a ribbon 
diagram of papain with the side chains of the active site 
cluster [C25, H159] shown in red, the second cluster [K17, K174, 
Y186] shown in yellow, and the two isolated residues E52 and R96 
shown in blue. 

A sequence alignment was performed for plant cysteine 
proteases from alder, Arabidopsis thaliana, barley, ice plant, 
kiwi, rice, and sweet potato and compared with papain. Sequence 
searches were preformed using WU-Blast2 http: //blast . wustl ■ ed u/ 
[Gish et al. 1993] and sequence alignments were performed with 
ClustalW http ; / /www . ebi . ac . uk/clus talw [Thompson et al. 1994]. 
The overall sequence identity scores for these proteins with 
papain are in the range 45-51%. Table 4 (shown below) shows the 
alignment results for all of the residues determined to be 
anomalous in papain by THEMATICS. Note how all five residues in 
the two positive clusters are perfectly conserved across the 
species, but the two isolated anomalous residues, E52 and R96, 
are not perfectly conserved. E52 is conserved in all but two of 
the species; R96 is not conserved. 
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Table 4. Sequence alignment results of the anomalous residues for papain across similar plant 
proteases using THEMATICS. (Percent identity with papain is shown in parentheses.) 



Species 


[C25, 


HI 59] 


[K17, K174, 


Y186] 


[E52] 


[R96] 


Papaya 


C 


H 


K 


K 


Y 


E 


R 


Alder (48%) 


C 


H 


K 


K 


Y 


E 


N 


Arabidopsis (49%) 


C 


H 


K 


K 


Y 


Q 


S 


Barley (49%) 


C 


H 


K 


K 


Y 


E 


N 


Ice plant (45%) 


C 


H 


K 


K 


Y 


Q 


K 


Kiwi (49%) 


C 


H 


K 


K 


Y 


E 


N 


Rice (51%) 


C 


H 


K 


K 


Y 


E 


R 


Sweet potato (49%) 


C 


H 


K 


K 


Y 


E 


K 



15 EXAMPLE IX 

Characterxza-tion of protein function 
applying THEMATICS to serine proteases 

The specific serine protease Kex2 was compared to the 

structurally related non-specific serine protease subtilisin 

20 Carlsberg. These two proteins have identical catalytic residues 
but one has specificity determinants that the other protein lacks. 
THEMATTICS identifies the catalytic residues for both specific and 
non-specific proteases and also identifies the recognition 
residues for a specific protease. The ability to identify sites 

25 that govern recognition opens the door to better understanding of 
specificity and to the design of highly specific inhibitors. 

The active site of an enzyme in the serine protease family 
possesses the ^^catalytic triad/' an acid, a histidine, and a 
serine, where the acid is an aspartate or glutamate side chain, 

30 the histidine acts as a catalytic base, and the serine serves as a 
nucleophile \liolmquist 2000; Di Cera et al. 2003]. This catalytic 
triad performs hydrolysis on peptide bonds. In the case of a 
specific protease, the protein recognizes the side chain of one or 
more residues adjacent to the peptide bond to be hydrolyzed. The 

35 residue that provides the carbonyl carbon atom of the reactive 
peptide bond is labeled as Pi. The pocket in the protease that 
recognizes the side chain of the Pi residue is called the Si 
pocket. Similarly, the next residue to the N-terminal side of the 
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substrate protein is labeled as P2 and it is recognized by the S2 
site of the protease, and so on. 

Kex2 (kexin; E.G. 3.4.21.61) is a calcium dependent 
transmembrane protease found in the yeast Saccharomyces 
5 cerevisiae. The name derives from the killer phenotype of Kex2 
mutant cells. In Saccharomyces cerevislae^ Kex2 is required for 
the production and secretion of mature alpha-mating factor and 
killer toxin, among other activities. Proteolysis occurs at 
paired basic residues. 

10 Kex2 is the prototype of a large family of eukaryotin pro- 

protein processing proteases that includes furin, the PC's, and 
PACE4 in mammals. This family of proteases is responsible for 
the processing of neuropeptides and peptide hormones, 
proinsulin, coagulation factors and many growth hormones and 

15 their receptors, Alzheimer-related secretases and cancer- 
associated extracellular matrix proteinases. In addition, furin 
is known to function in embryogenesis and homeostasis, and is 
required for the activation of many bacterial toxin precursors 
and virus envelope glycoproteins. 

20 Kex2 specifically cleaves peptide bonds where Pi and P2 are 

the basic residues arginine and lysine. {Kex2 cleaves 
preferentially after RR and RK. ) Kex2 belongs to a family of 
proteases structurally related to the subtilisins. However 
subtilisin is a very non-specific protease the structural 

25 similarity does not account for the unusual specificity observed 
for Kex2. Inhibited structures of the catalytic and P domains of 
Kex2 [Holyoak et ai. 2003] and furin [Henrich at al. 2003] have 
been determined, and show the source of the calcium dependence 
and the nature of the high specificity. 

30 Subtilisin is mechanistically and structurally related to 

Kex2 . However subtilisin is non-specific and cleaves peptide 
bonds in proteins with little regard for the nature of the 
surrounding residues, except for a small preference for 
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hydrophobic residues. Subtilisin is used commercially in 

laundry detergents to degrade proteinaceous material into small, 
soluble fragments . 

THEMATICS calculations were performed on the 2.4 A 
5 resolution structure of Kex2 [ Holyoak et al. 2003] from yeast. 
Protein Data Bank (PDB) [Berman et al. 2000] code 10T5 and on 
the 1.2 A resolution structure [Bode et al. 1987] of subtilisin 
Carlsberg from Bacillus subtilis, ICSE. Table 5 shows the 
results of these calculations for the two featured proteins plus 

10 some additional enzymes chosen to highlight different types of 
recognition. THEMATICS positive residues for these examples are 
known to be involved in catalysis and recognition. 

THEMATICS positive residues for Kex2 were found to be: 
[D175, D176, D210, D211, E220, H213, H381, Y212] , [D277, D320, 

15 E350] , [D184], [Y308], where residues that are clustered 
together in coordinate space are shown together in square 
brackets and where known catalytic residues are shown in 
boldface and known recognition residues are shown in italics. 
The first and largest cluster includes the ionizable catalytic 

20 residues D175 and H213. This same cluster also includes the S2 
recognition residues, D176, D210 and D211. The second cluster 
contains the SI recognition residues D277, D320 and E350. D184 
and Y308 are each isolated in space and their functional 
significance is currently unknown. 

25 THEMATICS positive residues for subtilisin were found to be 

[D32, H64] , the acid and histidine residues of the known 
catalytic triad, and D41 (presumed to be an isolated false 
positive) . No residues, analogous to those in the first two 
clusters for Kex2, indicative of specificity were found. 

30 Table 5 also summarizes THEMATICS results for some 

additional enzymes. The last two enzymes in the Table are active 
as dimers and the calculations for these were performed on the 
dimer structure. In all of these enzymes, the method finds at 
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least one residue involved in catalysis and, except for the non- 
specific subtilisin, at least one residue involved in 
recognition. Overall, these enzymes represent a variety of 
different kinds of chemistry and topology and different types of 
5 recognition. Kex2 utilizes the ionizable side chains of the 
acidic residues aspartate and glutamate to recognize arginine in 
the SI pocket and either arginine or lysine in the 32 pocket; 
THEMATICS identifies residues in both of these pockets. For 
arginine kinase [Zhou et al. 1998], THEMATICS identifies a set 

10 of arginine residues that recognize a phosphate group. Creatine 
amidinohydrolase [Coll et al. 1990] has two glutamate side 
chains that recognize a guanidinium group of the substrate^ and 
these glutamate residues are found to be positive by THEMATICS. 
THEMATICS also identifies as positive the residues in the site 

15 where the halogen is recognized in L-2-haloacid dehalogenase [Li 
et al. 1998; Ridder et al. 1999]. 
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Table 5. THEMATICS results of positive residues that are known catalytic residues (bold) or 
known recognition residues (italicized), with residues in physical proximity in the protein shown 
together in square brackets. Prime indicates a residue from another subimit in a dimer. 

5 PDB ID / Name Chemistry Topology 



10T5 Kex2 serine protease subtilisin+ jelly roll 

catalytic residues: D175, H213 

basic side chain recognition: D277, D320, E350, D176, D210, D211 
1 0 all positives: [D175^7 75,1)27 0,Z)2i7,E220,H213,H381,Y212], [D277,D320,E350\ [D184], 
[Y308] 

ICSE Subtilisin Carlsberg serine protease subtilisin 

catalytic residues: D32, H64 
1 5 all positives: [D32,H64], P41] 

IBGO Arginine kinase phosphate transfer creatine kinase 

catalytic residues: E225 

phosphate recognition: R126, R229, R280, R309 
2 0 all positives: [7?725,7?229,D226,E224,E225], [R124^2<S0,7?30P,R330], [C127], [E335], [HI 85], 
[Y134], [Y145] 

1 CHM Creatine amidinohydrolase C-N bond hydrolysis imique 
catalytic residues: H232 

2 5 guanidinium recognition: E262, E358 

all positives: [R64',R335,D83',£:2d2,£'i5«,H232,H324,H376], [D217,H33r,Y48], [E124], [Y54], 
[Y257] 

1QQ5 L-2-Haloacid dehalogenase dehalogenation unique 

3 0 catalytic residues: D8, D176, K147 

recognition residues for halide: R39 
all positives: [7?5P,D8,D176,K147,Y153], [R217',E44,K41,Y45,Y68], [Y223] 



35 EXAMPLE X 

Applxca'tlon of s'ta-txs'bical analysis t:o 
predicted titration curves for HPPK 

The enzyme HPPK ( 6-Hydroxymethyl-7 , 8-dihydropterin 

pyrophosphokinase) , a bacterial kinase, is discussed above in 

40 Example V. We illustrate how statistical analysis can be used 

to select the residues with anomalous titration behavior, using 

the complete set of predicted titration curves C(pH) for the 

HPPK structure from the PDB (IHKA) for E. coli. We analyze the 
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third and fourth moments of the first derivative function of the 
titration curves for each of the ionizable residues. In this 
example, we analyze all of the ionizable residues together. (In 
a variation of this procedure, we compare only residues of the 
5 same type, so that arginines are analyzed separately, as are 
aspartates, etc.) The absolute value of the third moment and 
the fourth moment are two of the preferred measures to determine 
the degree of deviation from normal behavior. Table 6 shows the 
residues with the highest absolute value of the third moment. 

10 This enzyme has 45 ionizable residues; the ten with the highest 
third moment magnitudes are shown. Residues are listed in rank 
order, with the highest value first. For each residue, a label 
is given, indicating whether the residue is known to be in the 
active site or in the second shell. The label ^"No"' indicates 

15 that it has no known functional role. Residues above the solid 
line in the table have an absolute value of the third moment 
that is more than one standard deviation away from the mean 
value (Mean = 0.31; a = 0.75). Notice that the active site 
residues, shown in boldface, tend to be at the top of the list. 

20 Second shell residues are shown in italics. Notice that, if the 
absolute value of the third moment magnitude is used as the sole 
criterion for selection and the cut-off is set at greater than 
one standard deviation from the mean, this statistical procedure 
finds the active site cluster [D97, H115] , consisting of the two 

25 known catalytic residues, and one isolated false positive, E77. 
However, this procedure misses D95, which is located in the 
active site pocket and might play a role in catalysis. 



-38- 

ATTOiy^EY DOCKET NO-. NU-593XX 
WEINGARTEN, SCHURGIN, 
GAGNEBIN & LEBOVICI LLP 
TEL. (617) 542-2290 
FAX. (617) 451-0313 



Table 6. HPPK - Residues with Highest Absolute Value of the Third Moment 



10 



15 



20 



Residue Label Absolute Value of Third Momeiit 

Asp97 Active site 4.63 

HisllS Active site 1.75 

Glu77 No 1.15 



Arg84 

Asp95 

Tyr53 

Tyr40 

R121 

D139 

E67 



No 

Active site 

No 

No 

No 

No 

No 



0.65 

0.59 
0.57 
0.33 
0.30 
0.23 
0.23 



Table 7 shows the ten residues with the highest values for the 
fourth moment. Residues are listed in rank order, with the 
highest fourth moment first. Residues above the solid line in 
the table have an absolute value of the fourth moment that is 
greater than one standard deviation away from the mean value 
(Mean = 3.6; cs = 4.1). Again, the active site residues are at 
the top of the list. 



25 



Table 7. Residues with Highest Fourth Moment 



30 Residue Label Fourth Moment 

Asp97 Active site 25.9 

Asp95 Active site 10.9 

HisllS Active site 9.8 

Glu77 No 9.6 

35 Arg84 No 4.9 

Glu68 No 4.3 

Tyr53 No 4.2 

His72 No 3.8 

Aspl39 No 3.6 

40 Glul41 No 3.4 



If the fourth moment is used as the sole criterion for selection 
and the cut-off is set at greater than one standard deviation 



-39- 

ATTORNEY DOCKET NO. NU-593XX 
WEINGARTEN, SCHURGIN, 
GAGNEBIN & LEBOVICI LLP 
TEL. (617) 542-2290 
FAX. {617) 451-0313 



from the mean, this statistical procedure finds the active site 
cluster [D95, D97, H115] and one isolated false positive, E77. 

This example illustrates how statistical analysis of the 
curve features can be used to identify THEMATICS positive 
5 residues and THEMATICS positive clusters. A number of 

variations are possible in the statistical analysis, a few of 
which are shown above, that lead to the correct prediction of 
active sites. 



10 EXAMPUS XI 

Applica'blon of au'boma'ted classif icat:±on 
"bo predicbed 'tx'tra'blon curves 

In a neural network (NN) binary classifier, there are 

several input nodes, one per numerical feature, and one output 

15 node, whose value determines the class label predicted by the 
network. In order to allow the network to compute a nonlinear 
function of its input, there is also at least one layer of 
hidden nodes. For a NN classifier each node computes a weighted 
sum of its inputs and passes the result through a sigmoid 

20 function, which is designed to be a dif f erent iable approximation 
to a threshold function. The weights in the network are 
variable parameters that are adjusted to fit the training data. 

We have applied Neural Nets (NN) and Support Vector 
Machines (SVM) to classify the predicted titration curves of 

25 residues as either positive (perturbed) or negative (ordinary) . 
In these methods, the classifier is first trained on a set 
training data where a human has already performed this 
classification. The machine is then given a previously unseen 
example to classify. 

30 The training data come from the visual classification of 

titration curve shapes by a human observer. The observer 
classifies each titration curve as either as positive (perturbed 
in a manner indicative of active site involvement) versus 
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negative (ordinary Henderson-Hasselbalch) . Here are some major 
results of our machine learning so far using neural networks: 
The charge curves for 21 proteins were used in a neural net 
analysis for training and testing. These proteins are 

5 acetylcholine esterase, aldose reductase, HIV-1 protease, 6- 
Hydroxymethyl-7, 8-dihydropterin pyrophosphokinase (HPPK) , 

triosephosphate isomerase (TIM) , papain, pepsin, glutamate 
racemase, adenosine kinase, ACC synthase (ACCS) , alanine 
racemase, aspartate aminotransferase, phosphomannose isomerase 

0 (PMI), colicin E3, subtilisin Carlsberg, ketosteroid isomerase 
(KSI), phosphoglucose isomerase (PGI) , chorismate mutase, 
apurinic apyrimidinic endonuclease (Apel), germin and galactose 
mutarotase. These proteins have a total of 2,512 ionizable 
residues , 

5 These tests used networks having 16 input nodes and a 

single hidden layer with 4-6 nodes. In each case, classifiers 
were created using two different strategies, first using the 
positive training examples as given and then replicating the 
positive examples multiple times to make up for the fact that 

0 there are many more negatives than positives. 

For these studies a separate network was trained for each 
ionizable residue type (Arg, Asp, Cys, Glu, His, Lys, Tyr) . In 
each case, twenty proteins were used for training and one was held 
out for testing. The machine learning results for the test 

5 protein either duplicate the results of obtained by the human 
observer, or they find the positive residues of the observer plus 
something that the observer originally overlooked, thus 
outperforming the human observer. 

One example where the automated classifier outperformed the 

0 human observer is triosephosphate isomerase (TIM) . TIM was held 
out and training was performed on the remaining 20 proteins. When 
presented with the titration curve data for TIM, the automated 
classifier found Lysl3 to be a positive residue, in addition to 
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the residues found to be positive by the human observer and by 
statistical analysis. Lysl3 is an experimentally confirmed active 
site residue (Zhang 1994) that was originally overlooked by the 
human observers. Thus, the automated classifier gives a more 
5 complete list of active site residues than the simple visual or 
statistical classification methods. Similar results were obtained 
using SVM's. 

EXAMPI£ XII 

10 Characterization of active sites with THEMATICS — Application to 

some Vitamin B6 Dependent Enzymes 

Useful information is derived from a comparison of THEMATICS 
results for enzymes with functional similarities and differences. 

15 Here we compare results for three vitamin B6 (PLP) dependent 
enzymes. Alanine racemase (AlaR) ; D-amino acid aminotransferase 
(DaAT) ; and Aspartate aminotransferase (AspAT) . These three 
proteins represent three different fold types and three 
independent evolutionary lineages. They also represent two 

20 different types of chemistry, racemization and transamination. 

All of the catalytic bases for the three enzymes (K39 and 
Y265 in AlaR, K145 in DaAT, and K258 in AspAT) have highly 
perturbed titration curves, capable of amphoteric behavior over a 
wide pH range. This means that they can serve as proton acceptors 

25 and then release the proton to regenerate the base. The titration 
curves for these catalytic bases have similar shapes. 

A nxomber of other interesting similarities are found for the 
three proteins. Each of the catalytic lysine residues (K39 in 
AlaR, K145 in DaAT, and K258 in AspAT) has a nearby residue with a 

30 heavily downshifted pKa (D313' in AlaR, E32 in DaAT and Y70' in 
AspAT) . All three enzymes have histidine residues with 

downshifted pKa^s located near the pyridine ring of the PLP (H166 
in AlaR; HlOO in DaAT; and H143 and H189 in AspAT) . These 
similarities appear to be markers for PLP-dependency . 
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From the point of view of function prediction, the 
differences between alanine racemase and the two transaminases are 
especially of interest. It has been noted previously that the 
transaminases have an acid group near the nitrogen atom of the PLP 
5 (E177 in DaAT and D222 in AspAT) whereas the residue closest to 
the ring nitrogen atom in alanine racemase is R219. We now find 
that E177 in DaAT and D222 in AspAT are characterized by strongly 
downshifted pKa's. This, means that in the two transaminases, the 
nitrogen atom of the PLP ring is likely to remain protonated. The 

10 interaction between the pyridinium ion and the side chain 
carboxylate group thus may be characterized as an ion pairing 
interaction with the proton strongly localized on the PLP nitrogen 
atom. The proximity of a very-low-pKa acid to the pyridinium 
nitrogen atom therefore is likely to keep the proton inside a 

15 relatively deep potential well and the pyridinium nitrogen end of 
the PLP ring is therefore able to serve as an electron sink during 
transamination. It has been noted previously that, since the 
closest group to the nitrogen atom of the PLP ring in alanine 
racemase is R219, the ring nitrogen atom is more difficult to 

20 protonate. We now have the additional information that R219 of 
alanine racemase has a very upshifted pKa and, therefore, retains 
its positive charge over a wider pH range and binds the proton 
more strongly than a more typical arginine side group. These 
properties related to pH-dependent behavior appear to serve as 

25 markers for the specific types of chemistry that a B6-dependent 
enzyme can catalyze. 

Another very interesting difference between the racemase and 
the two transaminases pertains to the environment of the phenolic 
oxygen atom on the PLP ring. In alanine racemase, the closest 

30 group to this oxygen atom in the external aldimine is the side 
chain of Argl36, which was found to have a very high pKa- Since 
R136 is thus expected to remain protonated over a wide pH range, 
the deprotonated form of the phenolic O atom of PLP is stabilized 
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at physiological pH. A negatively charged phenolate on the PLP 
ring would tend to prevent the PLP from serving as an electron 
sink. Thus the environment around the phenol group of the PLP 
affects the PLP in the same direction as does the environment 
5 around pyridine nitrogen atom: both tend to destabilize negative 
charge in the PLP ring. On the other hand in the two 
transaminases, the closest groups to the phenolic oxygen atom of 
PLP are residues with very elongated titration curves (Y31 and 
K145 in DaAT; Y225 in AspAT) . The low pKa' s of Y31 in DaAT and 

10 Y225 in AspAT suggest that the phenolic oxygen atom of the 
external aldimine is protonated, which would tend to help the PLP 
ring serve as an electron sink. Because the phenolic oxygen atom 
is surrounded by soft residues, it is possible that it undergoes 
proton transfer during the transamination reaction. These are 

15 additional markers of the specific types of chemistry for a B6 
enzyme and thus can help to identify similar chemistry in a 
protein of unknown function. 

In this fashion, the predicted pH-dependent properties 
(titration curve shapes and the calculated pKa's) of the ionizable 

20 residues in a protein can be used to identify specific types of 
chemistry. We have identified markers of PLP dependence. We have 
also identified markers to distinguish between a racemase and a 
transaminase . 
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While the present invention has been described in 
conjunction with a preferred embodiment, one of ordinary skill, 
after reading the foregoing specification, will be able to effect 
various changes, substitutions of equivalents, and other 
20 alterations to the compositions and methods set forth herein. It 
is therefore intended that the protection granted by Letters 
Patent hereon be limited only by the definitions contained in the 
appended claims and equivalents thereof - 
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